1 Overview

Bike-sharing systems are a sustainable and flexible transportation solution that allows people to borrow bicycles from designated stations throughout a city. One of the important challenges facing bike-sharing systems is to ensure bikes are available when and where they are needed. This situation raises awareness of how to efficiently redistribute bicycles in a certain time period across different stations to meet diverse demands at different times and locations, which is also known as rebalancing.

Re-balancing aims to prevent docks from being either completely full or empty, thus maintaining operational efficiency and user satisfaction. The bike-sharing system would perform more effectively when high-demand stations are filled with bikes to meet user needs, while low-demand stations maintain a suitable number of bikes to ensure availability. The process requires precise anticipation and prediction of bike share demand to relocate bikes.

This exercise aims to enhance rebalancing using predictive machine learning models and Ordinary Least Square (OLS) regression that can forecast short-term demand based on spatial patterns and time lag effects in Jersey City. Moreover, discounts or rewards can be offered to users to encourage them to return bikes to docks that are likely to be empty or to rent from docks that are near capacity. Since the model is time-space dependent, historical data, weather data, and lag-effect variables will be considered and blended into the model.

The dataset utilized in this analysis is sourced from NYC OpenData and contains comprehensive records of Citi Bike usage, updated monthly. While Jersey City is situated in New Jersey, Citi Bike’s operations are primarily centered in New York City, which manages and releases the dataset. The temporal scope of our study focuses on the period from March 1, 2024, to April 1, 2024, exclusively examining bikeshare data within Jersey City. This dataset principally comprises the locations and names of the stations where users initiate and conclude their journeys. Additionally, to enrich our spatial analysis, we procured the census tract geometries for all of Hudson County, including Jersey City, using the tidycensus package.

Our current bikeshare dataset, while including spatial attributes such as latitude and longitude, isn’t inherently spatial. To rectify this, we plan to spatially enable the dataset by associating each ride’s starting location with the corresponding census tract geometry. Our focus is on the origins of rides, as we aim to predict station capacity based on the demand for bikes. Essential to this process is the exclusion of records lacking precise latitude and longitude data. Additionally, we’ve identified and will need to discard entries with incorrect geocodes.

we created line strings that connect between the start and end locations of each ride. For the simplicity of visualization, we only visualize 200 rides. We can see that most of the rides occur within Jersey City,

2 Exploratory Analysis

In this section, we perform some further exploratory analysis of our data set.

2.1 Temporal Analysis

Based on the chart, it appears that there is a distinct pattern in the usage of bike share trips in Jersey City throughout March 2024. The data suggests regular fluctuations that likely correspond to daily cycles of peak and off-peak usage hours. It suggests increased bike share usage during weekdays, consistent with rush hours, and a noticeable drop during the weekends. This indicates that the bike share system is heavily utilized for commuting purposes during the work week, with reduced usage on weekends, possibly when people are less likely to follow work-related routines. This temporal pattern is typical for urban bike share programs, reflecting the ebb and flow of a city’s daily life and weekly cycle.

When visualizing the distribution of bike share trips by station for each hour in March, there’s a significant skew towards the lower end of the trip count spectrum. This means that a large number of the bike share stations have very few trips (close to zero), while a small number of stations have a high frequency of usage, with the maximum observed being 30 rides in an hour for a specific station on a given day.

The histograms depict the mean number of bike share trips per station during different times of day in Jersey City. They reveal a right-skewed distribution, where most stations see few hourly trips, while a small number have higher usage. During the AM Rush, there’s a noticeable number of stations with under 5 trips, reflecting morning commutes. Mid-Day shows a significant decrease in trips, indicating less bike share use. Overnight usage drops further, with most stations nearing zero trips, which is expected during late hours. The PM Rush sees an uptick in trips compared to Mid-Day and Overnight, but not as high as the AM Rush, suggesting a less intense evening commute. Overall, these patterns suggest that certain stations, likely in key locations, have much higher demand, particularly during rush hours, pointing to opportunities for optimizing bike station placement and bike availability.

The chart illustrates the variation in bike share usage throughout the week in Jersey City for March 2024, with distinct patterns emerging between weekdays and weekends. On weekdays, there are two pronounced peaks each day, likely corresponding to the morning and evening rush hours, suggesting a heavy use of bikes for work commutes. The morning peaks are higher than the evening ones, indicating a potential preference for biking to work or a more concentrated start to the workday. Over the weekend, the usage pattern shifts, showing a single, less sharp peak during the midday hours, reflecting a more leisurely use of the bike share system, possibly for recreational activities or errands. This trend points to a dual nature of the bike share system: it serves as a critical component of weekday commute infrastructure and as a recreational amenity during weekends.

The graph compares the total bike share trips in Jersey City during weekdays and weekends in March 2024. The weekday line (blue) shows a bi-modal distribution, with two significant peaks that suggest high trip counts during typical morning and evening rush hours, indicating a strong use for commuting. The weekend line (red) presents a more bell-shaped curve with a single, broad peak that peaks lower than the highest weekday peak. This likely indicates a more leisurely use of bikes, with trips spread out across the day rather than concentrated around rush hours. Overall, bike share usage is significantly higher on weekdays compared to weekends, underscoring its role in supporting the daily work commute. The dramatic difference between the shapes of the two lines emphasizes the change in travel patterns between the workweek and the weekend.

We analyzed our bikeshare dataset by aggregating the number of trips according to the station, time of day, and whether the day was part of a weekday or weekend. Our objective was to discern the bike demand across different periods and conditions. For clarity in our visualization, we concentrated on the most significant data points, selecting the top 50 instances where bikeshare use was highest. The resulting visualization paints a clear picture: PM rush hours witness a substantial spike in bikeshare activity, particularly at specific stations. A noteworthy observation is that weekend rides are less frequent among the top occurrences, only appearing three times within the top 50.

A closer look at the data uncovers some standout numbers. For instance, the Grove St PATH station registers a remarkable total of 740 rides during the PM rush on weekdays and another significant count of 402 rides overnight during weekdays of March 2024. Similarly, the Hoboken Terminal - River St & Hudson Pl is a hub of activity with a total of 613 rides noted during the overnight hours on weekdays.

2.2 Spatial Analysis

From this visual representation, we can deduce which stations are most frequent during different times of the day. During the weekday AM and PM Rush hours, certain stations have significantly higher usage, likely located in or near business districts or major transit hubs. Mid-Day and Overnight show reduced activity, as indicated by smaller or fewer circles. The weekend maps indicate a more even distribution of rides across the day, with no single time period showing dominant demand as seen during the weekday rush hours.

The observed Moran’s I value for the actual data is indicated by a vertical line in a different color (presumably the one noted in red in the chart’s title), which falls outside of the bulk of the simulated distribution. The observed Moran’s I is to the right of the distribution, it suggests a positive spatial autocorrelation, meaning that stations with a high number of rides are likely to be clustered near other stations with a high number of rides.

The map illustrates a distinct pattern of clustering among bike share stations with higher demand. Specifically, it reveals two prominent clusters where a select few stations experience an especially high volume of rides. These central nodes of activity are encircled by stations with moderately fewer rides, indicating a gradient of demand within each cluster.

2.3 Space-time Correlation

The animated map presents a dynamic choropleth representation, merging both spatial and temporal aspects of bikeshare usage into a cohesive visualization. It updates every 15 minutes to reflect changes in bike ride patterns across the area of interest.

Bikeshare Animation
Bikeshare Animation
The Bikeshare Animation for Jersey City

2.4 Weather Analysis

Considering the influence of meteorological conditions on bikeshare demand, we postulate that inclement weather, such as rain, strong winds, or extreme heat, typically leads to a decline in usage. To validate this, we acquired hourly weather data from Newark Airport for the period spanning March 1, 2024, to April 1, 2024, using the riem_measures function, which is representative of weather patterns for the entirety of Jersey City. A summary panel, termed the weather.Panel, was created to encapsulate key weather variables — temperature, precipitation, and wind speed — on an hourly basis.

The provided time series charts display hourly weather data for a location near Newark Airport, representative of Jersey City’s conditions, from March 4th to April 1st, 2024.The precipitation chart reveals sporadic rainfall with periods of no rain interrupted by occasional spikes indicating rain events. The wind speed chart shows variability, with certain hours exhibiting calm conditions and others experiencing significant gusts. Lastly, the temperature chart demonstrates daily fluctuations consistent with the diurnal cycle and changing weather patterns over the course of the month.

3 Space-Time Panel

In the following steps, we created a study panel where each instance in the panel is a unique combination of space and time. In other words, each row will represent the ride at a particular station during a particular hour.

3.1 Weather Correlation

The red trend lines across the panels indicate that as temperature rises, the mean trip count tends to increase as well, suggesting a positive correlation between warmer temperatures and the use of bikeshare services. This pattern is consistent across all the weeks presented, although the density of points and the slope of the trend lines suggest that the strength of this relationship might vary from week to week.

The red trend lines superimposed on the scatter plots suggest an overall relationship between wind speed and bikeshare usage. However, unlike the previous chart with temperature, the relationship here does not seem as straightforward or consistent across the weeks.

For some weeks, the trend line appears to show a slight positive correlation, implying that trip counts increase with wind speed up to a certain point. This could be non-intuitive as higher wind speeds might typically deter outdoor activities like biking. For other weeks, the trend line is almost horizontal, suggesting little to no correlation between wind speed and trip counts. This could indicate that wind speed has a less direct or more variable impact on bikeshare usage.

3.2 Serial Correlation

If there’s a temporal correlation in the number of bikeshare trips, incorporating time lag features could enhance the accuracy of predictive models. This is based on the premise that the volume of trips during a given hour is likely to be related to the volume in the hours just preceding or following it. Understanding trip patterns from one hour can thus inform us about trip patterns in adjacent hours. To harness this relationship, we’ve computed six distinct time lag variables to aid in our estimations.

Red trend lines indicate the overall direction and strength of the relationship across all instances. For lag intervals like 1 hour (lag1Hour), there’s a strong positive correlation, as shown by the dense cluster of points and the upward-sloping trend line. This suggests that the number of trips is highly predictive of the trip count in the next hour. As the lag intervals increase, the correlation seems to disperse, especially noticeable in lags like 12 hours and 1 day, where the spread of data points is wider, and the trend line is less steep.

4 Spatial Correlation

We’ve seen before that spatial autocorrelation exists here in that the number of trips at a particular station during a particular time is correlated with the number of trips at nearby stations. This means that if we know the number of trips at a station, we may be able to estimate the number of trips at the station next to it. For each hour, we then calculate the nearest 5 neighbor of each station and the average of rides for that 5 stations. The results were joined back to our panel dataset. The observation indicates that the total number of trips is similar across adjacent census tracts.

5 Regression Models

Four models are produced from the initial 3-2 week training-test split , with increasing complexity, from a just time based, just spatial based (fixed-effect), a time-spatial model to a time spatial with lagged features model. As can be observed in the following chart, the greatest improvement to the model is made by adding the lagged variables, given that things that happen in time (as well as in space) are more related to closer events than to farther events.

5.1 Predict for test data

A better notion of how much the fitness of the models improves is given by looking at ridership as a function of time for both the predicted and the actual ridership.

## # A tibble: 8 Ă— 8
##    week data     Regression       Prediction Observed Absolute_Error   MAE sd_AE
##   <dbl> <list>   <chr>            <list>     <list>   <list>         <dbl> <dbl>
## 1     9 <tibble> ATime_FE         <dbl>      <dbl>    <dbl [6,480]>  0.246 0.588
## 2    10 <tibble> ATime_FE         <dbl>      <dbl>    <dbl [15,030]> 0.238 0.430
## 3     9 <tibble> BSpace_FE        <dbl>      <dbl>    <dbl [6,480]>  0.223 0.547
## 4    10 <tibble> BSpace_FE        <dbl>      <dbl>    <dbl [15,030]> 0.192 0.420
## 5     9 <tibble> CTime_Space_FE   <dbl>      <dbl>    <dbl [6,480]>  0.228 0.544
## 6    10 <tibble> CTime_Space_FE   <dbl>      <dbl>    <dbl [15,030]> 0.204 0.413
## 7     9 <tibble> DTime_Space_FE_… <dbl>      <dbl>    <dbl [6,480]>  0.185 0.503
## 8    10 <tibble> DTime_Space_FE_… <dbl>      <dbl>    <dbl [15,030]> 0.161 0.401

5.2 Examine Error Metrics for Accuracy

From the following two plots, we can find regression model 4 considering hour lags has the best goodness of fit. Model 4 is also not perfectly predicting may because of university spring break in March, while the time of spring break in unsure by institutions so unfortunately we didn’t involve holiday lag and holidays in the model. The relatively low ridership in general may also be another reason for the under prediction.

Then, we will continue examine spatial distribution of the MAE with model 4.

From the following map, we can tell there’s no significant cluster of the high error, while the model4 generally perform well in the southwestern part of Jersey Cit with low error, generally indicating a uniform predictive capability across various locations.

There is also no significant difference between the MAE in the two weeks (week 9 and week 10).

5.3 Space-Time Error Evaluation

This series of plots presents a comparison between observed and predicted bike-share ridership. The analysis is broken down by different times of the day and by weekday versus weekend, allowing us to discern patterns of under or over-prediction. It seems that during the AM Rush and PM Rush on weekdays, and Mid-Day on weekends, the predictions are closer to the observed values, suggesting a higher model accuracy during peak usage times.

The scatter plots examine the relationship between socio-economic factors — median income, percentage using public transport, and percentage of white residents — and the predictive errors. There is a visible trend that as median income and percentage of white residents increase, so does the MAE, suggesting that these socio-economic factors may influence ridership patterns in ways that the model does not fully capture. The percentage of residents using public transportation shows less clear of a trend, indicating the need for further investigation into how public transportation usage correlates with bike-share ridership.

5.4 Cross-validation

Finally, a 100-fold cross validations are made on the whole March data, and get a final MAE of 0.177, which includes that even tho this model cant perfectly predict the result, but the error is not that significant. However, this small MAE is also due to the small ridership at each station.

Mean Absolute Error Standard Deviation of MAE
0.1774951 0.0182814

6 Conclusion

This project has utilized the application of machine learning techniques and regression analysis to predict the demand for bike-sharing services in Jersey City, focusing on a time-space model with lagged features. Our findings explained the role of predictive modeling in optimizing bicycle distribution across the city to increase operational efficiency and user satisfaction.

We observed that predictive accuracy varied according to different times of the day and days of the week. Our model also showed higher accuracy during peak rush hours and lower accuracy during off-peak times and weekends. This variation pointed to the necessity of incorporating diverse time-dependent variables into the model to better predict demand fluctuations.

Adding spatial and temporal lags into the model significantly improved the machine learning models’ predictive power. The complex model will also ensurethat high-demand stations are adequately stocked to meet user needs while low-demand stations are maintained to prevent overflow.

However, there are some limitations in the model sections, such as the impact of unconsidered variables like holiday schedules and some extent of weather conditions. The model’s performance during certain periods, such as the university spring break, could affect the accuracy of the prediction model. The improvement can include a broader range of factors, including academic calendars and detailed weather impact studies.